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GMIS: AN EXPERIMENTAL SYSTEM FOR DATA MANAGEMENT AND ANALYSIS 

John J. Donovan 
Henry D. Jacoby 

How many people would climb aboard a trans-Atlantic flight if they 
thought the air.line Tacked the Capability to process volumes of weather 
ami traffic data, and to' plan a safe route? Not many, for most of us 
have come to expect that the very best information processing services 
will be applied in this circumstance. Yet public policymakers and corporate 
executives are regularly faced with far more complex and serious problems 
(perhaps with risks that are less immediate and obvious), and roust make 
decisions without the capacity to manage and analyze the pertinent infor- 
mation. This happens for several reasons: Circumstances arise unexpectedly, 
and under current technology there simply is not time to construct the 
necessary software, or decisions may not occur regularly eijough^to justify' 
the cost of a normal information management system, particularly when its 
useful life may be cut short by changing circumstances. In this paper we 
report on an effort to design and implement tools appropriate to this cir- 
cumstance. 

The system under development is called GMIS ^neralized Management 
Information System), and we present the underlying architecture of the 
system and its rationale, along with a sample demonstration of its char- 
acteristics. We begin, in Section 1, with a brief history of the effort 
and a summary statement of what the system is designed to do. In order 
to give a quick summary of how the system works, Sectign 2 is an overview 
of the software architecture; and then Section 3 uses a sample application 



to an energy analysis problem in order to describe ho* the system is used 
and^hat some of its more Important features are. For the reader interested 
In details we return in Section 4 to more discussion of the techniques and 
methods used in building GMIS. Finally, since this is a report of research 
in process, a summary of topics of continuing research is given in Section 5. 
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1. HISTORY AND PURPOSE OF GMIS 

♦ 

GMIS is being developed at the M.I. T. Energy Laboratory in conjunc- 
tion with the Sloan School's Center for Information Systems Research and 'IBM, 
The project started in 1973 based on ongoing research in the Sloan School on 
file systems [Madnick. 1970] and operating systems [Donovan. 1972; Madnick 
and Donovan, 1974]. However, it has been the urgency of particular appli- 
cations to energy problems that has shaped the work and quickened its pace. 

During the energy crisis of the winter of 1973/74, policymakers in 
New England were handicapped by a lack of information about the region's 
energy economy. In response to this circumstance, the New England Regional 
Commission (NERCOM) initiated a project to develop a New Engla(nd Energy 
Management Information System (NEEMIS). The initial plan was to develop a 
"crisis management" system to assist in the handling of fOel oil allocation, 
but over time (though the original function remains an important one) the 
needs have grown and the emphasis shifted. Problems of the ^onomic impact 
of high oil prices have taken on more importance along with policies and 
programs to foster energy conservation. New issues have arisen concerning 
the location of major energy facilities, bringing a need for analysis of 
associated economic and environmental issues. 

Growing experience with the data also brought more demands on the 
system design. The data are of varying quality; data collection procedures 
are changing over time, with series being dropped and added and definitions 
being revised. The requirements for protection have proved complex, for 
they vary with levels of aggregation and time. (For example, an oil company 
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tnay be Milling to give out data on Its aggregate transactions, but not on 
details that may help a competitor.) Finally, the need for a facility to 
apply various analytical models to the data has become more apparent. 

In this circumstance, our approach has been to develop a gener^il set 
of tools for speedy construction and easy modification of management infor- 
mation systems. Essentially, the need is for a software facility suitable 
for situations where the problem addressed is constantly changing, or 
where an information system is in its formulative stages and users are 
unable to specify exactly what they want the system to do, or precisely 
what the data streams will look likeJn the future. 

To meet these requirements, certain characteristics of the system 
seem essential: it needs to be multi-user and interactive; it should be 
capable of storing, validating, and retrieving datai and it ought to hjave 
the capability to respond to changing data and data structure, and to vary- 
ing protection requirements. It should provide tools for constructing 
analytical and statistical models to be applied to the data, but a facility 

0 

to construct these models from scratch appears Insufficient. Many econo- 
mists and modelers have strong preferences for particular modeling facfli- 
ties such as TROLL [NBER, 1973], XSIM [Dynamics Association, 1974], 
TSP [Hall, 1975], PL/I, EPUN [Schober, 1974], and FORTRAN; large invest- 

* 

ments have been made in packages using these languages, and access to these 
facilities can save tremendous costs in retraining personnel and converting 
existing mo^ls. . 

Existing commerical data base systems — e.g., IMS [IBM, 1968], DBTG 
[Association for Computing Macf^inery, 1971], System 2000 [MRI Systems, 1974], 

TOTAL [Cincom Systems, Inc.« 1/974] etc. — have proved their usefulness 
ERIC ^ ^ 
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in particular applications. But none has the range of desired charactersitiqs 
outlined above. Some are lacking the statistical and modeling packages, 
not all are interactive, and not all can allow multiple users to access 
the same data base. Most important, none was designed for a changing 
environment. As detailed below, the 6MIS system has taken a long step-in 
this direction. Using this facility, it is possible to construct an Infor- 
mation system in a matter-t}f days. For example, in the course of work 
on the NEEMIS System, chages in the New England energy situation made it 
necessary to reconstruct the entire data base five times in one month 
during the sunmer of 1975 - once to incorporate additional data in existing 
data series, twice for efficiency reasons, and twice because new data and 
models had to be added as new problems became apwrent. 

In the sections that follow, we give a bri ef\overvi ew of the architec- 
ture of the GMIS system and then illustrate the system\haracteristics 
by means of an example drawn from one of its energy applications. For 
the reader interested in the details of software design, the discussion 
goes on to cover more of the details of the system and its various components. 
Since the discussion cannot cover all aspects of the system, however, it 
is useful to summarize the requirements that the GMIS system has been 
designed to meet. First, in the area of data management the current system 
has the following features: 

- it allows on-line interactive data management as well as 

a batch facility; 

- It allows for storage of large quantities of various types 

of data; 
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- it allows the changing of data, addition of new data series, 

♦ 

modification of tables (data structures) 
' - it gives the user simple and consistent view of the way 
data is stored in the system; 

- it permits several users to select and access data according 

to many criteria, as it is impossible to specify in 
advance all the ways the data will be used; 

- it allows for easy viewing of data, and contains facilities for 

validation of data; 

- it provides facilities to interactively change data pro- 
tection; \ ) 



- it is abl4 to store data about data (e.g., confidence 

levels^; 

- it provides a mechanism for assuring the integrity of the 

data; and 

♦ 

- it provides mechanisms for monitoring and tuning performance. 

The modeling and analytical capabilities Introduce several additional 
features. Since GMIS provides access to such faciliites as APL, PL/I, 
TSP, EPLAN, and FORTRAN, 1t provides the user with an efficient flexible 
environment to specify, construct, and execute statistical analyses and 
model studies, and to produce the associated plots and reports. 
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2. OVERVIEW OF THE SYSTEM ARCHITECTURE 



Currently GMIS is Implemented on an IBH System/3/O computer. It uses 
the Virtual Machine (VM) concept extensively^ A virtual machine may be' 
defined as a replica af a real computer system simulated by a combination 
of a Virtual Machine Monitor (VMM) software program and appropriate hard- 
ware support. For example, the VM/370 system enables a single IBM Systeiii/370 
to appear functionally as though it were multiple Independent Sy$tem/37d'$ 
(i.e., multiple "virtual machines"). Thus, a VMM can make one computer 
system function as though it were multiple, physically isolated systeiis. 

A configuration of virtual machines used in GMIS is depicted in 
Figure 1, where each box denotes a separate virtual machine. Those vir- 
tual machines across the top of the figure are executing pj'^grams that 
provide user interfaces, whether they be analytical faculties, existing 
models, or data base systems. All these programs can access data managed 
by the general data management facility running on the virtual machine 
depicted in the center of the page. A sample use of this architecture 
might proceed as follows. A user activates a model, say in the APL/EPLAN 
machine. That model requests data from the general data base machine 
(called the Transaction Virtual Machfne, or TVM), which responds by pissing 
back the requested data. Note that all the analytical facilities and data 
base facilities may be incompatible with each other, in that they may run 
under different operating systems. The communications facility between 
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^ The VM concept is presented in several places CParmelee,>|972/ Madnick and 
Donovan, 1974; and Goldberg, 1973], and many of its advantages are articu- 
lated elsewhere [Madnick, 1969; Buzen et. al., 1973]. The concept of 
"virtual machines" has been developed by IBM to the point of a production 
O system release, VM/370 [IBM, 1972J. 
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VM(2) 



VM(3) 



VM(4) # VM(5) 



TRANSACT 
INTERFACE 



APL/EPLAM 
INTERFACE 
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LANGUAGE 
INTERFACE, 
e.g., PL/I, 
FORTRAN 



TSP 
INTERFACE 
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INTERFACE 
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VM(n) 



CUSTOMIZE 
INTERFACE 
WRITTEN IN 
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Figure 1: Overview of the Software Architecture of GMIS 
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virtual machines 1n GMIS is described lin Section 4.1.1. Extensions to 
this architecture to allow interfaces to other data base system^ and other 
computer systems are discussed in Section 4.1.2. 

GMIS Software has been designed using a hierarchical approach [Madnick, 1975, 
1970; Dijkstra, 1968; Gutentag, 1975]. Several levels of software exist, 
where eaqh level only calls the levels below it. Each higher level con- 
tains increasingly more general functions and requires less user sophis- 
tication foV use. The transaction virtual machine depicted in Figure 1 
shows only two of these levels, the Multi-User Interface and SEQUEL 
[Chamberlain, 1974]. The data base capabilities of this machine are based 
%n the relational vi^»<^ diita [Codd, 1970]. In this section, each box 
will be briefly described. In Section 4 we return to describe some of 
the technologies used In Implementing these boxes. 

2.1 Structured English Query Language (SEQUEL) 

Me felt that the data management system would best be based on the 
relational model and hierarchical construction as this offered data 
independence, integrity, and a framework for reducing complexity. As 
part of our research on this topic, we proceeded with an implementation of 
an M.I.T. relational system [Smith, 1974]. However, in the current ver- 
sion of GMIS the data management capability is based on an experimental 
relational query and data definition language known as SEQUEL which has 
been developed at the IBM San Jose Research Laboratory [Chamberlain, 1974]. 
In cooperation with the IBM Cambridge Scientific Center and the 
IBM Research Laboratory at San Jose, we have extended this 
experimental system by easing restrictions on the data types it could 

15 
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handle and relaxing constriants on the number of columns allowed in a 
table, and by Increasing the allowable lengths of identifiers and charac- 
ter strings.' Vfe also designed mechanisms for security and for handling 
missing data, expanded the bulk loading facilities, added additional 
syntax, and made several changes to improve performance. 

2. 2 Multi-User Transaction Interface 

Two requirements of GMIS are that multiple users be able to access 
the same data base and that different analytical and modeling facilities 
be able to access the data base all at the same time. For example, one 
user may want to build an econometric model using TSP while another user 
will request the system to generate a standard report. Still a third user 
may want to query the data base from an APL [Iverson, 1962; Pakin, 1972] , 
environment. These requirements have been met with the design and imple- 
mentation of the Multi-User Transaction Interface [Gutentag, 1975]. Each 
GMIS user operates in his own virtual machine with a copy of the user 
interface he requires. Each user transaction to the data base is written 
Into a transaction file, and that user's request for processing is sent 
to the data base machine (Transaction Virtual Machine) as indicated in 
Figure 1. The Multi-User Interface processes each request in a first- in/ 
first-out (FIFO) order, by reading the selected user's transaction file, 
and writing the results to a reply file that belongs to the user. Each 
user interface reads the reply file as if the reply had been passed 
directly from the data base management system. This procedure is discussed 
at greater length In Section 4.1.1 below. 



2.3 User Interfaces 

GMIS provides the capability for users to write their own interfaces 
to cofrmunicate with the data base system. TRANSACT is a geneg^aV user 



interface that is designed to process transactions fro#Jinb§t i^|j^pewriter$ 



and CRT terminals. It allows the user to direct transagPlo?F^.p|;t^ut to any 
virtual device on the. VM/370. 

Interfaces to APL, TSP, EPLAN and PL/I are operational and enable 
users to communicate with the Transaction Virtual Machine (Figure 1) 
simultaneously with all other users. An interface to the TROLL econometric 
modeling facility is in the design stage. 

The architecture depicted in Figure V^lso allows the use of any of 
these modeling or analytical facilities independent of the transaction 
virtual machine. For example, functions may be written in APLN:^o^perate 
on data stored in the APL's work space. TSP modeling and reporting capa- 
bilities can operate on data stored in TSP's data base. FORTRAN or PL/I can 
operate on data stored in the virtual machine that they are running. It 
should be noted, however, that, not using' the general data base facility seriously 
inhibits flexibility and makes the algorithms dependent on the physical 
organization of the data but more importantly inhibits the community of 
users as they cannot conveniently access the ccmnon data base. 
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3. SAMPLE APPLICATION OF GMrS 

To demonstrate the characteristics of the exii^ting GMIS System, / 
we use an example drawn from work done for the Federal Energy Adminis- V 
tration on the construction of indicators of domestic energy conditions 
CM. I. T. Energy Laboratory, 1975] J Thex)bject of this particular 
indicator was to give a picture of future trends in gasoline consumption. 
It was proposed that- the .indicator be depicted as a series or plot of the 
average miles per gallon of each month's new car sales. Policymakers could 
note if the average^fuel efficiency of new cars was going down or up, hence 
reducing or Increasing future demand for gasoline. 

The indicator is shown in Figure 2. Several points concerning the 

figure and its derivation are worth noting; 

* - 

(1) The plot covers the 15-month period from January 1974 to 
March 1975. It is surprising to find that during the 
"energy crisis',' the average miles per gallon of nfiW cars 
sold actually went down! We had tnitlally expected that 
during that time people would have purchased smaller, more 
efficient cars, resulting in an Increase in average miles 

* 

per gallon. Why did It go down? 

(2) Note that since the graph raises additional questions, it 
becomes necessary, in order to resolves these questions, to 
access and analyze the data in ways not originally 
planned for. 



Marvin Essrig is responsible for the initial construction of this example. 



(3) The data from which the graph is derived comes from a 
variety of soiJrces, each using different terminology and 
dissimilar means of presentation, 

(4) The data is both numeric and non-numeric (e«g,» name 
of models of cars). 

The remainder of this section shows how GMIS was used to construct and 
analyze this indicator. Two user interfaces of GMIS will be used: 

(1) TRANSACT is jjn interface to the data management 
level (SEQOEL), which includes a Data Definition LangMage 

(DDL) and Data Manipulation Language (DHL)'. ^Thls level 

- 'f 

can be used to: * '* 

- restructure the data, 

- input the data, and 

- query data. ^ ^ 

(2) APL/EPLAN is the anlaytical, modeling, and statistical 
level, which resides above the multi-user interface 
(Figure 1). EPLAN is a set of routines imbedded in APL 
for doinq statistical functions and reportinq. 

3.1 Data Manipulation 

An example of creating a table and Inserting data into it via 
TRANSACT-SEQUEL will demonstrate how a user stores data in GMIS. Note 
that 'all data are viewed as residing In tables, as in the relational model 
of data [Codd, 1972]. The tables have columns whose entries come from sets 
of elements called domaifls. Figure 3 is an example of a table. 




^ EPLAN is now available as an IBM product under the name "APL Econo- 
ERIC metric Planning Language" [IBM, 1975]. 
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colics 



Model 


Date 


Sales 


MPG 


Cadillac 


1/74 


9.948 


10,9 ' 


Vega 


1/74 


33^600 


30.2 


Pinto 


1/74 


35,531 


28,0 


Pontiac 


1/74 


.10.170 


13.8 












entries 



f 



Figure 3. Sample Table 



3.1.1 Data Definition Facility 

A data structure is created in TRANSACT by using SEQUEL commands^ by 
first defining the des^ired domains, then declaring a group of columns 
to be a table, and finally, inserting data into the table. 

The interactive session to create the table presented in Figure 3 
is found in Figure 4, where the comnands shaded are user inputted. 
The first four commands establish the existence of the four domains: 
(models), (vol), (date), and (mpg). The domain 'model' will hold Information 
stored as characters, while 'date', 'vol', and 'mpg' will consist 
of numeric data. — * 



A complete syntax description of TRANSACT and SEQUEL comnands Is 
available in a GMIS Primer [M.I.T. Energy Laboratory. 1975], 



V 
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OOMATN DEFINITIOI 



OK WAS SUCCESSFUL. 
DOMAHI 0EriniTIUU_WA5 SUCCESSFUL. 

fc f c« te ' cloria t n vol {riufef;"! 
0OMAIII DCFIIIITION WAS SUCCESSFUL 

DOIIAIH DEFItilTION IMS SUCCESSFUL. 
READY; 

(data^JT 

MiTlOtT WAS SUCCESSFUC? 




TABLE DEFI 
READY; 




INSERTION t/AS SUCCESSFUL. 



READY; 



T3 — -r 

INSERTS DATA INTO A* TABLE 



»tODEL 



DATE SALES 



MPG 



VEGA 



7li0l 38«»5! 



UPDATE l/AS SUCCESSFULr^ 



READ*; 



MODEL 

VEGA 
READY; 



DATE SALES 



7<i01 



33600 



3102 




HPG 



302 



Figure 4. Example of Table Creation and Data Entry 
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The mkt comiand creates a table called CARSALES. The first coluiii 

is labelled MODEL, and entries in this column will be classified as 

belonging to the set (or domain) model. The other three columns 

are defined in a similar fashion, where entries in the colunm sales are 

the volume of cars sold duHng the month entered in the column date, 

The INSERT statement Of Figure'4' results in the Insertion of one 
entry into each column of table CARSALES. The SELECT * comnand results 
in the printing of all entries in table CARSALES. The UPDATE command 
results in changing one entry' in the table. Mote that the change is 
reflected iTTth^output from the next SELECT cojimand^ 

3.1.2 Bulk Loading Facility . 

Suppose that a great deal of data were to be loaded into table 
CARSALES. Inputting it via the console, as in the previous example, 
would be prohibitively slow and costly. A bulk loading facility has 
been implemented to reconcile this matter. A series of data cards and 
their appropriate header cards 'for input into the bulk loader are 
shown in Figure 5. The bulk loader will accept these cards, define the 
indicated domains, create the table, and insert the data into the appro- 
priate columns of the|ta^e? For a complete explanation of formats and 

uses of the bu11( toader7 see the "GMIS Primer ^[M. I. iTlnergyllboratory, 
1975]. 
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c«rs«l«t d«ta 



f DBF DOB 


HOOIL 


CHAI 1 


SDBPDOH 


TOL 


lOH 1 


SDBfDOH 


RP6 


R OH 


SDEFDOH 


DATE 


HUH J 


SCEPTAB 


CABSALES 


nODBL 




nODEL 




DATE 


DATE 




VOLUHE 


VOL 




HPG 


FiPG 




MODEL 


DATE 


SLOADTiB 


CARSALES 






nODEL 


1 




DATE 


1 




VOLOHE 


1 




BPG 


1 




SBNDCOL 






CHE?HOLBT 


12«7401 


33108 


CORVETTE 


1547401 


2078 


CHEVELLB 


1797401 


21175 


CHETY HO?A 


1877401 


21464 


SPORTVAN 


1527401 


1370 


MOUTE CARLO 


1497401 


15668 


CAHARO 


1797401 


8787 


VEGA 


3027401 


384 55 


POiTIAC 


1387401 


10170 


GRAHD PRIX 


1037401 


4042 


FIREBIRD 


1797401 


3666 


VEHTURA 


1217401 


4890 


OLDSHOBILB 


1107401 


10533 


SlIDLOAD 






SIIDIIP 








Figure 5. Example of a File Reacly for Bulk Loading 
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3.1.3 System Inquiry Facility 

The TRANSACT- SEQUEL level has a number of "system commands" for 

( 

inquiring about tables as opposed to their contents. For example. 
Figure 6 demonstrates some of these commands. The first command lists all 
tables that have been created* Note that the system created three tables 

" (INTEGRTY, DOMCAT, and CATALOG) for its own use. The next command lists 
information about the table, CARSALES, where the system response lists the 
name of each column, the domain from which the entries for that column are 

taken, and the data type of each column (either 'XHAR" or "NUMMi The next 

command lists information about domains, 

3.1.4 Query* Facility 

Figure 7 illustrates queries to the tables that ha^e been created. 
All queries start with the vwrd SELECT. The first two queries ask the 
system' to list the contents of the tables CARSALES and MILEAGE. The rest 
of the queries contain a "WHERE" clause which allows the user to select 
only data that meet certain requirements. Note that the SELECT command 
may be used to specify queried that require data from more than one table. 

^ The general form and syntax of the SELECT command is found in the %MIS 
Primer" [M.I.T. Energy Laboihatory, 1975], 

,) 
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LIST OP TABLES 



IBlECBTt DOBCIT CXTilOG CARSUIS HI"*C« CAlDm 




\ 



DESCRIPTIOM OF TABLE CAFSALBS 
KAHE DOHAIH TYPE 



Command to describe 
table named CARSAIES 



10DEL 
DATE 
SAtBS 
HPS 



MODEL 
DATE 
¥0L 
HPG 



CHAR 

If on 




LIST OF DOMAINS: NAME 



RELNAHB 

CNAHB 

COLHAHE 

DOHMAHE 

SfSCHil 

STSVOIf 

DATE 

HODEL 

fOL 

IF6 

BIKE 

HPG 



TYPE 



CHAR 
CHAR 
CHAH 
CHAR 
CHAR 

ion 

NUM 

CHAR 

BOH 

BUn 

CHAl 

CHAB 



Figure 6. Examples of Inquiries about Tables 
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1370 
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♦ 14.9 


CAHARO' 




7401 


8787 


17.9 


POMTIAC 




7401 t 


10170 


, lit 


GRAIII) PPIX 




7401 * 


> 4042 


103 




SB 



HODBi 



lIAr< ' HPGCITY 



MPGUHt 



MP6ATG 



GREflllN 

HORMET 

If AT A DOB 

APOLLO 

SKIHAUK 

CliTOBY 

LBSABPE 

KLICTRA 



1975 
1975 
1975 
1975 
1975 
1975 
1975 
1975 



19 
18 
14 
16 
19 
16 
12 
11 



24 
24 
19 
21 
25 
24 
16 



210 
203 
159 
179 
213 
188 
135 
125 



Figure 7. Sample Table Queries 




iUSTRNG 
STAHFIRE 
VAtlANT 
AST RE 
nUSTARG 
PTMTO 




THE JESUIT OP TOOB QUEBl IS; 
155 
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Figure 7 (cont'd), sample Table Queries 
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3.2 Modeling find Analytic Functions 

3.2.1. Validatirng Data. The data for this example indicator came 
from many sources. Data in the table CARSALES came from "Ward's Auto- 
mobile Reports" [Ward's* 1975]. The data in the table MILEAGE came from 
two Environmental Protection Agency documents [EPA, 1974; EPA, 1975]; 

1 9 74 data w as f ound in the "1974 G as Mil e ag e Guid e for New Car Buyers," 

and the 1975 data was from a similar document entitled "1975 Gas Mileage 
Guide for New Car Buyers." 

The data stored in the M^iLEAGE table was entered (using the bulk 
loading facility) as it appeared in the 1974 and 1975 EPA documents. 
However, inconsistencies resulted from two factors: 

(13 Miles per gallon (mpg) for 1974 data was a single number averag- 
ing city and highway driving, whereas data for 1975 was two num- 
bers reflecting both city and highway driving, 

(2) There was a 5% change in the method used by the EPA to determine 

^ the mileage values from 1974 to 1975. 

Let us demonstrate the interaction between a model ing^cility and 

i 

the data base facility by normalizing the data to reflect the inconsistency 
in (1) above, thus allowing fair comparison between 1974 and 1975 mpg data, 
perform the following three steps using the APL level: (The reader 
should keep in mind Figure 1, depicting the relationship between the two 
virtual machines, one running APL and the Transaction Virtual Machine). 

(1) Extract data from the data base facility. 

(2) Perform a correcting function on it. 

(3) Insert the corrected data back into the data base facility. 



29 



-24- 



Figure 8 exhitnts the console session to perform the above three 
tasks. Our strategy is to convert for each inodel the two 1975 numbers 
(mpg in city driving, mpg on the highway) into one comparable to the one 
1974 number. 

(1) To extract the data (city moQ, hiqhwa.y mpq for each model 

for 1975) we use the QUERY command of Figure 8. The QUERY com- 
mand is a function that has been added to APL to interface 
between the two virtual machines. The APL QUERY function passes 
the'^given SEQUEL command in quotes ta the Transaction Virtual 
Machine. The TVM then gets the data and passes it back to the 
APL workspace and APL prints the names of the vectors passed 
back, in this case MODEL, CITYMPG, and HWYMPG. The software 
mechanisms for accomplishing this communication are transparent 
to a user at the APL level. > They are described later in Section 4.1.1. 

(2) The following function was performed on city miles per gallon, 
and highway miles per gallon to get one value that was consistent 
with 1974 values. 



Avg.MPG = (1) 

.45 . .55 
HWYMPfi * CitYHPG, 

In Figure 8 function (1) was envoked by typing Its name, 'CHANGE'. 
For the reader's information we listed the APL implementation of 
function (1), Note that the APL implementation not only performed 
function (1), but It also created the necessary QUERY command to 
Insert the new data back Into the data base. 

30 
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CITIMPC) 
nVJMPQ J 




^CnAnOF t.AtBiC.DiKi 



111 

C3] 
C*! 
C5] 
t6] 
C7] 
C8] 
191 
CIO] 

Cll] 

C12} 

Ci3J 



A*-* UPDATE MTLEACE SUT MPGAVC 
E-*-*'AnO lEAK = 1975;' 

L4: /r*-o.u5»/rmpcrr] 
iri*o.55*crrrAiP6rr] 

p*-' » • • . (MWirc ifooKCr-,] ) , • • • • 
Qywr A,B,c,n,F 



INCREMDIT COUHTER 
AND GO BACK TO 
STATEMENT [5] [6] 
UNTIL FINISHED T*I?LE 



CONCENTRATE FUCL UDPATE 
SEQUEL COHHAND (COHHAS ^ 
^ERFOflH-CONeATENTATION) J 
AND PASS QUERY TO SEQUEL ) 
FOR INSERTION OF COR- < 
RECTEO VALUES BACK 



Figured: Exanple Cleaning of Deta 



ERIC 



31 



> 



The reader who is not familiar with APL can use the comnents of the 
listing. It is not necessary for readers of this paper to thoroughly 
understand APU For those who wish to do so, the references [Iverson* 1962j 
Pakin, 1972] can be consul te^l. 

A similar function was applied to correct the 5% difference in data 
TTp^rffng of (d) above. ^ 

3.2.2 Reporting 

A GMIS user has the full reporting capabilities of any of the modeling 
or analytical facilities at his disposal. Fo^ example, a GMIS user can 
employ the APL/EPLAN facif||y^a report generator and to produce plots. 
To produce the indicator filotted in Figure 2, the following steps were 
followed. 

(1) Use the QUERY command to extract the desired data 

a" 

(2) Execute /APL function to calculate the average miles per gallon 
of all cars sold dijring a given month from the data in the three 
created tables using the following formula: 



3^ Vol. x Mpg^ 
Average Mpg. All Cars » — y 



(3) Convert the resulting vector into a time series. 

(4) Use "the EPUN plot facility to produce the PLOT of Figure 2. 
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, As was discussed in Section 3, this plot raises several questions. 

< ' Why did the average miles per^ gallon of all cars sold during the months 

af the energy crisis go -down? We had expected that it would go up- because 
people would have boulght high mileage cars during a shortage of gasoline. 

One possible explanation is that the wealthy were relatively 
unaffected by the energy crisis and thus they continued to b$jy large, 
luxurious, lower mileage cars. This may have res.ulted.in a dispropor- 
tionate smaller number of compact, low-niileage cars sold. Ajipther * 
explanation might be that the car dealers, seeing an end to the popularity 
of large cars, lowered prices on these models greatly, thus iijducing a 
larger than expected sale of these cars. Another is that forei gn compacts 
(which we did not include) encroached on the sale of American compacts. 

In order to rejfolve these questions, it becomes hecessary to access 
the data in a different way than we had initially expected. A plot of 
the sales ^f a luxury car (e.g., Cadillacs) and the sales of a compact 
(e.g.. Valiants) over the same period would indicate how the sales of 
these groups behaved during that period. 

Again, operating on the modeling level, the following three steps 
are taken (the corresponding console session is shown in Figure 9). 

cT ^ (1) Extract the data using QUERY commands 

(2) Convert the data from a vector to a time series using the 
. APL D£ function^ e.g. , 

(3) use the EPLAN P. L 0 T function to produce the desired plot, 



^ In APL all function names, such as DF and PLOT, are underli^, as can be 
seen in Fgiure 9. Since variable names cannot have spaced n them, under- 
scores also are commonly used to clarify variable names, as has been done 
with CADILLAC JALES In the figure. 
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Note that the plot has car sales on the vertical axis and months on 
the horizontal axis. The 'o' denotes Cadillac sales, the '*' denotes 
Valiant sales. Figure 9 reveals thatr^the sales of Valiants showed a** 
definite downward trend starting from about the fifth month of 1974, 
while the sale of Cadillacs remained relatively constant. 

3.2.3 Modeling ^ 

In recent years increasing emphasis has been placed on the use 
of models to aid in policy decision making* A model is roughly defined 
as an incomplete representation of a system, where the purpose of the 
model governs which elements of a model can be adjusted to simulate a 
real world change in policy. The results of the sim^ation can then 
be studied and compared with other simulated courses of action before 
a final decision to effect change in the actual system is made. 

Another useful feature of a model is that it serves as a facility 
through which relationships between elements of a system can be explored. 
We can illustrate this capability 'by performing a simple analysis of the 
data already introduced in this example. Suppose one wanted 
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to Investigate the mathematical relationship between average 
miles per gj^llon of all cars sold in a month with that of all cars sold 1n 
some previous month. A correlation matrix depicting the strength of the 
relationship between average miles per gallon of all cars sold In a month 
with that of the previous month, and with tmat of two and three months ago, 
gives an Insight Into how a mathematlcal-^del of this relationship might 
behave. The EPLAN-££R and L. A ^ functions have been applied to the available 
data resulting in the correlation matrix show In Figure H. 
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Figure 10: Correlation Matrix 



ERIC 



Inspection of Figure 10 reveals that one ought to expect that the 
average miles per gallon of all cars sold In a month Is somehow strongly 
related to the average miles per gallon of all cars sold In the previous 
month, but does not appear to be highly correlated with the figures from 
two or three months ago (a correlation coefficient close to +1 Is regarded 
as an indication of a strong relationship between two variables, whereas 
a value/ of 0 Indicates a weak relationship). To explore this relationship 
further, an ordinary least squares regression analysis Is applied to the 
two variables using the EPLAN R £ 6 function (Figure 11). More precisely, 
w6 seek an equation of the form: 

' * 36 
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AVG MPG of CARSSOLD^ = o(^+ o^, * AVG MPG of dARSSOLD^_^' 

The estimated values of the coefficients o^q and W-^ from the table in 
Figure 11 are 4.928 and 0.706, with standard errors of 3.2 and 0.2, res- 
pectively. The fourth coluntn of the figure dejiicts the T statistic for the 
estimated values ttf cL and o(j . 



WITHi 

' \ 

C OS e7 VALUE /ST URH/T-STAT 

1 4.92827 3.19856 1.5U078 

2 0.70615 0.19011 3%7mU3 / 

no OF VARIA3LES 1 .00000 

m OF OBSERVATIONS 13.0 0000 

SS DUU TO- liEGliUSSIOU.... 1.2 9413 " ' 

SS aUE TO RESIDUALS 1.03177 

F-STATISTIC. 13.79701 >, 

STAiiuARD ERROR 0.30j626 

i?*2 'STATISTIC 0.55640 

R*2 CORRECTED 0.55640 

UURBIfi WATS on STATISTIC, 1.10338 

UAiiSSOLDf ("4.928 * £ 1 ) 2 ( 0 . 706 Z ( 1 CARSSOLD)) 



Figure 11: Sample Regression 

of - 

Based on the results/this initial exploration, more complex formulations 
may be devised to help e}|pjain the behavior of the sales of different car 
models over this period, and all would be constructed in the'manner shown 
in Figure 11. Moreover, once underlying behavioral relations had been 
estimated, it might be desirable to build a simulation model to forecas^/~\ 
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automobile fuel co.nsumption in the future. Once again, all the programming 
tools and higher-order simulation languages could, be made 
available through the system outlined in Figure 1, with access to all the 
data and estimated relations produced in the course of the analysis. 
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4. DETAILS OF THE GMIS DESIGN 

There are three basic features of the GMIS system that give it its 
flexibility: (1) an overall system architecture making use of the 
(largely untapped) power of VM, (2) construction of the system within a 
hierarchical framework, and (3) the use of a relational representation 
of data. Section 2 gave a brief introduction to ^these^eatures , and ' \ 

here we discuss the role of each in greater detail. 

4.T> The Use of VM in the Software Architecture 

^ Through the use of the^VM concepts and the proposed architecture of Figure 1, 
a number of the important ffcatures of GMIS beconie possible, or much 
easier to implement: * . i 

(1) Hultl-user coordination of access and update to a central data 
base* 

V (2) An environment where ^veral different modeling facilities can 

access the same data base^ ^ 

(3) An environment where several different and potentially Incompati- i ^ 
ble data management systems can all be accessed by the same user 

• * 

models or facilities. 

(4) Increased security and reliability [Donovan and Madnick, 1975]. 

VM also has disadvantages, the primary one ojie being the potential increase 
in overhead costs associated with the synchronization and scheduling 
of the VM system. 




-34-. 



Figure 1 depicts a configuration of virtual machines operating on a 
single real computer. At t,he present time PL/I, FORTRAN, EPLAN/APL, and 
TSP are the only facilities interfaced, with t^e data management system. 
Work is under way to bring TROLL to this status. Some of these modules 
operate under ^a different operating system but are made to run on the same 
physical machine using VM/370. All the modeling or analytic virtual machine 
-navLj!equest^ata-f rom the general data management system. lo thi s^ection^ 



we discuss the techniques we used to facilitate the communications between 
these virtual machines, performance analysis, and proved extensions 
to this architecture. f • 



4.1.1 Communication between VM's , " ' . 

part of the IBM/MIT Jpint Studly a multi-user interface on the^data 
base machine has been implemented [Gutentag, 1975]. This 
interfjice allows several users (programs running on the VM's) to access the 
single' data base system. Note that for thils section a. distinction is made 
between a human user and a "user" of the multi-user interface, which is^ 
usually another program. . 

Essentially what is needed is a means of passing commands and data 
to the data base machine, returning data, and a locking and queueing 
mechanism. One way to pass date? is to use virtual card readers and card 
punchers. The data base virtual machine would be In wait state trying. - 
to read^ card from its virtual card reader, the analytical machine would 
punch the commands on the virtual card reader that would be read by the data 
base VM. This mechanism is inefficient, however, and does not allovfflexi- 
ble processing algorithms. 
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The mechanism implemented in GMIS is as follows (note that this 
mechanism is invisible to a modeler when he envokes the APL/EPLAN level 
command QUERY, as this tommand automatically envokes the mechanism). Each 
user virtual machine' (UVM), which is accessed by logging on to a separate 
account ID under VM/370, sends transactions to the Transaction Virtual 
Machine through a communications facility (described below). The Multi- 



User Interface (MUI) stacks these transaction requests and processes them 
one at a time. The results of each transaction are passed back to the 
virtual machine that made the request through the same cotimuni cations 
facility. Replies to the transactions may be processed with aii^ software 
interface that is required for the application, fhe APL/EPLAN Interface 
d^^ussed earlier has been implemented in this manner. 

The best- way to explain how the MUI works is to follow a user's 
virtual machine's transaction through each processing step. Refer to 
Figure 12 for an illustratidn of the< transaction processing scheme de- 
scribed below. Each user virtual machine must have a small virtual mini- 
disk attached to it that has been supplied with a multi-write password. 
This password allows more than one virtual machine to link to the disk 
with read/write privileges (otherwise, VM/370 only allows one user at • 
time to link to a di^k with writing privileges). 

When a user's virtual machine wants to send a transaction to the data 
base, it writes the transaction onto its multi-write disk In a CMS^ file 
that is reserved for .transactions (steps 1 and 2 of Figure 12). The user's 



-^CMS [IBM, 1974] is m operating system commonly run under VM/37Q, 
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Figure 12a. Sending a Transaction Request 
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Figure 12b. Returning Data 
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virtual machine must then signal to the MUI that it wants its transaction 
to be processed. This is done by directing the VM/370 Control Program (CP) 
to send all output from the user^s virtual card punch to the virtual card 
reader of the Transaction Virtual Machine (TVM), The user*s virtual machine 
then punches a single virtual card^containing two items of Information: 
the ID of his virtual machine, and a code indicating the type of file 
format that the MUI must use when passing the transaction reply back to 
the user virtual machine (step 3)* 

Each card punched by a user is actually a request to the MUI to 
process a transaction residing in the user's transaction file* These 
cards are stacked in the card reader of the TVM, and are processed one at 
a time, where the first card stacked is the first to be processed CFIFO) 
(step 4). 

The MUI is always running in a wait state or processing transactions. 
When a card is received by the TVM's virtual card reader, an interrupt Is 
generated that activates the NUr to begin reading from Its card reader. 

To read the user's transaction, the MUI must first access the user's 
transaction file, this is done by first linking to the multlHf^rite disk 

of the virtual machine given by the ID on the transaction request card. ' 

(Tlie multi-write disk is always attached at the same virtual' address; in 
the current Implementation, disk address 340 is used for all transaction 
files.) The disk is then accessed by the MUI, and its SEQSTAT SEQUEL 
file is read (step 5). It should be poted that the SEQUEL software level 
provides a file reading capability, 
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After the transaction has been processed by SEQUEL in the usual 
mahner (step 6), the MUI writes this reply .on the user's multi-write disk 
- in a file called SEQUEL REPLY (step 7), One of several file formats may 
be used, depending on the user's software environment. Three general 
formats have been proposed that will satisfy all currently anticipated 
GMIS requirements. Ofie format is to be read by APL programs, another 
format will be compatible with TROLL files, and a third format will be 
compatible with any language that can process sequential CMS files Ce,g,, 
PL/I, FORTRAN). The user's transaction request card indicates which file 
format ts to be used by the MUI. 

The TVM then punches a virtual card %o the UVM to signal completion 
of transaction processing (step 8), Finally, the UVM reads its SEQUEL REPLY 
file, and processes the transaction result in its own environment (step 9). 

4.1.2 Extensions of Arcfhitecture 

The following three extensions to the architecture of Figure 1 merit 
further Investigation. 

Incompatible Data Systems . Figure-13 depicts an extension of the 

— — arehTtect ure that would allow diffe r ent and perh aps t n com p atl M^^at^ 

base systems to be accessed by the modeling facilities. The general data 
J base system would act as a catalog for data stored in the decentralized 
system. The data management virtual machine acts as an interface, 
analysing the data query and funnel ing it to the appropriate data base 
management system. These mechanisms could be made invisible to the user, 
who can use the system as though he had all the data in one "virtual" 
data base. The implication of this extension on synchronization, data 
updating, and performance roust be further researched. 

erJc '^'^ 
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Figure 13: External Architecture 
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standardization of data base systems . It may useful to place 
user interfaces that are syntactically and semantical ly equivalent to existi 
data management systems (e.g., IMS, TOTAL) above XhB general data base 
system of Figure 1. This would allow data to be inputted and validated in 
a data system with which a user is familiar, and then stored in a stan- 

♦ 

dardi zed general data base system. 

Decentralized/centralized data bases . The advantages of decentra- 
lized data bases, are that they are usually maintained by the people that 
are using them. The advantage of a centralized data bese Is that many 
groups of people c«in access it. The above architecture may be extended 

to interface riot only with data base and modeling systems running in 

but to other remote con|}ut6rs 
other virtual machines,/ in eluding non- IBM equipment.. The implication of 

this extension on data updating and networking problems musi be investi- 
gated with further research. 
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^•"'•3 Degradation of Variable Cost with Multtpic VM Op eratioh 
The construction, of a system of cownuni eating VM's brings great 
< advantages, but these come at the expense of some sacrifice in performance. 

Various performance studies of VM's are available in the literature 
iHatfield, 1972, Goldberg, 1R74], and we are engaged in a theoretical 
and empirical analysis of the degradation of variable cost performance 
a§ a function of the number of modeling machines [Donova»rW5]. The 
direction of this work can be seen by considering a configuration as in 
Figure 1, where several modeling facilities, each running on a separate 
virtual machine, are accessing and updating a data base that 'Tsfliahaged by 
a data base management system running on'its separate virtual machine. What 
is the degradation of performance with etch additional user? What 
determines the length of time the data-base machine .takes to process a 
^ request? What is the best locking strategy? 

An access or update to the data-base machine m^ be initiated either 
by a user query, which would be passed on by the modeling machine, or by 
a^ model executing on the modeling machine. In either case, the data-base 
machine, while processing a request locks out (queues) all other requests. 
The analysis is further complicated by the fact that as some VM^s become 
locked, then others get more of the real CPU's time, and therefore 
generate requests faster. However, the data-base VM gets more of 



Here we are addressing the issue of variable costs. Later in Section 5.2 
we address the more important issue, fixed costs, for applications like 
those addressed by the GMIS^system. \. 



the CPU's time thereby processing requests faster. For example, if there 
are ten virtual machines, each one receives one-tenth of the real CPU. 
However, if. seven of the ten are in a locked state, then the remaining 
three receive one- third of the CPU. Thus, these three run (in real time) 
faster than they dfd when ten were running. 

To try to analyze this circtimstance for the uses outlined in this 
paper, we have assumed that the virtual speeds of VM's are constint and 
equaT. However, when some VM's (including the data-base VM) ar^-allo- 
cated a larger share of CPU processing power, they become faster in real 
time. We assume, that each unblocked VM receivies the same amount of 
CPU processing power and at the initial state m machines are running 
(i.e*, the data base machine is stopped if no modeling machines are 



iwiking rtqutsts). the request rate of each modeling VM when there are 

m VM's running. •)! • if thf ttrylct rata^t wHIeH the <ftta base virtual 
machlni If running whtn thtrt are m-1 modtling VM and ont data base VM 
running. Thus, we m^y write the relations: — 



X 



m X (1 • 1, 2, ...•m) 



i m-i+1 
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where 1 Is the number of modeling VH's being blocked. Using a birth/death 
process model [Drake, 1967], and using \ queueing .analysis [Little, 196)], 
Me get the following for the response time of the model: wh*re Is the 
steady state probability that there are 1 modeling machines waiting, and 
'N' is the number of modeling machines. 

m-1 



T '^^^ - Ji 1-0 ''i (^) 
\ m-1 i m-1\ 



^'overhead ' 

HI 

z IP 

T'«1t-for-d.ta = " * 



it 



Z y.P. 
1=1 ^ ^ 



^ total " overhead ^ ^model * ''"'wait-for-data 

Figur« 14 Illustrates the total time to execute three different models 
as a function of the nuij*)er of modeling V«*s. Let us consider some of the 
Implications of the above analysis. 
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First, for a X/y = .1, a model executing In a configuration of on« modeling 
machine stakes 110 units of time to execute. When the same model , run in 
an environment of 10 modeling machines all 'executing similar models, takes 
approximately 135 units of time to execute — a degradation of performance 
of slightly more than 15 percent. Intuitively, X denotes the speed of the 
modeling machine, and y is the speed of the data base machine. Thus a 
situation where X/y = .1 Indicates that the data base machine Is ten times 
faster than the modeling machine. From the same figure with ratio of X/y « 1, 
a model executing with a configuration of one modeling machine takes 20 units 
of time where with ten machines the same model takes approxi^ely SO units , 
of time — over four times longer. 

If such a degradation of performance Is not tollerable, there are 
several ways to Improve performance. The theoretical stu4y would indicate 
that increasing y for a given configuration helps performance. Practically 
this could be done by changing the processor scheduling algoirthm of VM 
so that the real processor was assigned to the data base management VM 
more often, thus speeding it up and increasing y. 

♦ 

Observing the equation for T'^^^^ above, another way of reducing 

^^tal *° ^'wait^forjata: """^ ^ '^""^ ■^'waitjorjata 

is to extend the VM architecture of Figure 1 to allow multiple data base- 
machines. In this configuration T'^^j^^^^o^jj^a ""^'^ ^ reduced by locking 
out all data base machines only when one modeling machine Is doing a write.- 
For all read requests the multiple data base machines would operate without 
locking. Shared locks between machines would have to be created as well 
as a mechanism for keeping a write request pending until all data base 
machines can be locked. 
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A way of improving ptrformance further would be to extend the single 
locking mechanism used in the above multl data base machine configuration 
to handle multiple locks. Locks would be associated with groupings of 
data, e.g., a table. The locking policy would be to have all machines only 
locked out of a portion of the data when one machine was writing Into that 
portion. Thus requests could be processed simultaneously for reads into . 

♦ 

tables not being written in and for reads to different tables. Thus 
adding another real processor to the multiple lock VM configuration could 
greatly improve performance. 

There is a trade off with the multl locking scheme between Increases 
in overhead time in maintaining multiple locks versus Increases In wait 
time for locked data bases. We have not yet extended the theoretical 
analysis to quantify this trade off. 

Other theoretical extensions and analyses of this synchronization 
model would include extending the model to cover a more comnon VM operating 
circumstance — namely, that where the GMIS system Cnwiltlple modeling 
machines and one data base machine) would have to share the physical 
machine with other users, also executing under VM, e.g., a payroll program 
under VS2 under VM, multiple CMS users, etc. 

In conclusion, we observe that there m^o< be a degradation in per- 
formance with multiple users but that there are mechanisms for ameliorating 
the effects of this degradation. 

» 

4.2 Hierarchical Approach 

We have used the design and implementation techniques of hierarchfcal 
decomp<?s1t1on extensively In our 1mplementatl|n. The hierarchical approach 
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has been used 1n operating systems [Dijkstra, 1968; Madnick and Donovan, 19: 
and in f tie system design [Madnick, 1970]» The essential idea of this 

approach is to decompose a system into functional leyels. Interfaces bf 

r 

each level consist of a series* of operators. Each level can only call 
levels below it. 

The levels we are using for the GMIS syst6m are the following: 

- a modeling level 

- a data definition and data manipulation language level 

- a relational level (oper^ators) 

- a file system 

- the operating system 

Further decompositions of the file system level and operating system 
level are outlined irt [Donovan ?nd Jacoby, 1975] and of the relational 
level in [Madnick, 1975]. 

The key advantage of this approach is that it reduces complexity 
by decomposing the problem into a series of manageable sub-problems. As a. 
consequence of this reduction In complexity, the time to implement an 
entire system is greatly reduced. Another advantage Is that the efficiency 
of the system can be increased.* These improvements in efficiency come 
from the fact that a system so constructed can be analyzed and tuned for 
performance because each level can be thoroughly understood and analyzed. 

For example, as new software algorithms are Invented, their place In 
the hierarchy can be identified and then can be easily incorporated without 
redesigning the entire system. As new hardware technologies become opera- 
tional, their relevance to information systems can be assessed within the 
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the fraraeworjc of the hierarchy, and incorporated where applicab.le. 

Given Inherent parallelism in infortnation systems^, the hierarchical 
approach also can capitalize on new technologies to increase the ^performance, • 
reliability* and integrity of information systems. An example of such a 
technological development is the advent of low-cost >nicroprocessors. These 
devices (which are the "computers" used in hand calculators} are becoming 
less expensive each year and have the computational capability of mar\y 
standard computers, e.g., arithmetic and logical operations, memory, and 
registers. To capital lae on this new technology, each level df the hierarcHy 
could be examined for operators that could be executed asynchronously with 
each other. These operators, as well as the control logic and synchroniza- 
tion mechanisms, could be performed by multiple microprocessors. 

Figure 15 depicts an example of such a hierarchical "decomposition 
using microprocessors where the vertical stacks of boxes denote requests 
In the form of operators, and each group of horizontal boxes denotes 
microprocessors to perform the desired operation. At the top of Figure 15 
a list of queries enters the system (e.g., the SELECT commands bf Figure 7}. 
The microprocessor of level 1+2 performs the necessary syntactical analysis 
and translation to produce a list of relational operators (operators on 
tables will be discussed in the\ext section). This list of functions com- 
posed of relational operators are processed by the microprocessors at level 
1+1. They in turn generate a number^ of requests to read tables stored In the 
main or secondary memory. Level 1 receives those requests and generates 
the appropriate operating system functions' to fulfill the request. The 
last group of microprocessors performs the desired operating system func- 

y ^ — ~ . ■ 
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tions and passes back the results to level i. The results are used by 
level i to produce its results, and then passed up to level 1+1 until the 
top level gets all the information to satisfy the query. 

One of the properties of implementation using hierarchical function 
deconposition is that all processors are anonymous and act as Interchange* 
able resources (within a function level). Thus, If a processor malfunc- 
tions or must be removed from service, the system can continue to function 
without interruption. After a reasonable amount of time has elapsed, 
the higher level processors that had generated requests that were befng" 
performed by the defective processor merely need to reissue the same 
requests. Alternatively, the reissuing of requests could be accomplished 
automatically by the inter-level request query mechanism. 

Although the details are not elaborated in this paper, it can be 
argued that extensive parallelism, throughput, and reliability can be 
attained by means of a multiple processor implementation of the hierarhtcal 
function decompostion. 

4.3 Relational Technolofly 

This sectian presents an intuitive understanding of relational 
operators, of the approach, and its usefulness to information systems of 
the type we address in this paper./ 

The language that a user, would use to query, insert, and Mpdate data 
is called a Data Manipulation Language (DHL). The language used to define 
tables,, domains, and charactersitics of the data is called a Data Definition 
Language (DDL). The user of GMIS can view all data stored in the system 
in the simple form of^ table (relation), as in Figure 3. This view of 
data is called the relational model of data [Codd, 1970], 
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If one were to view data as being stored In tables, then the process 
of querying the data could be broken down into two functional levels. The 
first is composed of mechanisms to, recognize the constructs of the query 
(e.g., a SELECT command), which takes place at level i+2 In Figure 15, and 
the second where the appropriate operations are performed on the tables to 
satisfy the SELECT cotimand (level i+1 in Ftgure 15). 

Part of pur research f»as been to determine tlie "appropriate" operations 
of level i+1 needed to query, update, and define data. In an early imple- 
mentation of GMIS we implemented twelve operators [Smith, 1975]. These 
operators includ e d those of Codd [Codd/ 1970] (In so m e cases mod1fi'ed~fo r = ^ 



use or perfornknce reasons) as well as three additional operators, compaction, 
difference, and ordering. 

4.3.1 Advantages of the Relational Approach 

A very attractive aspect of the relational approach is Its clear, 
well-defined interface that fits into the hierarchical approach and hence 
permits the attainment of all the benefits of the previous section, A 
distinction should be made (which is not often made In the literature) 
between the DDL/DML level and the relational operator level. As we shall 
see, the relational model of data allows us to implement an interactive 
DDL/DML easily. We recognize that other data models (e.g., network, 
hierarchical^ or tree structures) could also be used at a lower level to Implement 
the same DDL/DML, only not In as satisfactory a manner, and with a certain 
loss of capabilities. 

Our experience in using a relational base data management system is 
that there is a real comparative advantage for its use In systems^ whexe 
the logical data structure keeps changing. Its advantage Is the low cost 



i!!^ i^-^ hierarchical here refers to a tree structure, which is different 
from the "hierarchical" approach, g.^., 



of adapting to changing data structures and further. In Its use in GMIS, 
in not having to^redo all existing modeling programs. It has a comparative 
advantage for implementing an interactive DDL/DML, Its comparative advan- 
tage, in applications where the types of queries are not all defined before 
Implementation, lies in the inherent property of allowing selective access' 
to. any data in the data base^ As we will discuss at the end of this 
section, we recognize the present limitations of the relational approach 
and do not necessarily advocate it for all data management applications. 

4.3.2 Basics of Relational Operators 

Let us take an example and demonstrate two relational operators, 
"restriction" and "projection". Assume that data exists- as in Figure 3 
and a query is made, "SELECT the model of car that receives 30.2 miles per 
gallon". The query processor (level i+2 of Figure 15) would translate this 
query into a series of operators on the table CARSALES. Basically, once 
the query is recognized there are two operations that could give the de- 
sired information: (1) find all entries that tiaVe mph equal to 30.2; 
(2) list the models in those entires. 

Figure 16 demonstrates these two operations on the table. All rela- 
tional operations create new relations. The first operator used is 
called "restriction", whose function can be intuitively defined as,' 
"produce a relation containing all elements of a table that match par- 
ticular restricting conditions.." Thus, restricting the relation at the 
top of Figure 16 by the condition MPG = 30.2 produces'the relation con- 
tafnfng the stngTe tupte: ' 

vega, 1/74, 37600, 30.2 
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^^ODEL YEAR VOLUME MPS 


CADILLAC 




9^94 • 


10.9 


VEGA 


1/74 


53,600 


30 J> 


WMTO 


1/74 


35,531 


28.0 


N>NTIAC 


1/74 


10,170 


I3.t 


i 


: .1 


I : 


1 VESA 1 1/74 


33,600 1 30.2 1 
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« 

V 

RESTRICTION AND PROJECTION OPERATORS 
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There Have been several experimental implementations of the relational 
view of data. For example, IS6 [Smith, 1974}, MACAIMS [Goldstein, Strnad, 
1971], SEQUEL [Chamberlain, 1974], Colard [Bracchi, 1972], RIL [Fehder, 
1972]. In GMIS we are using an extended version of SEQUEL discussed in 
Section 2.1. ♦ 

Our experience leads to several conclusions: From a user view the 

primary advantage of SEQUEL and other relitional systems is that they can 

be interactive, and have a simple, consistent way of viewing data. From an 

implementor's view the relational implementaion of SEQUEL fits into a 

hierarchical approach, the operations are consistent, and tt provides a 

framework in which to examine performance. We recognize the present 

limitations of the experimental SEQUEL for real applications. Me list 

some of those here (not as a criticism of the iraplementors of SEQUEL, for 

their purpose. was to demonstrate feasibiHty not an operational system) 

to guard ajainst the danger that our enthusiasm for this approach will 

1 

lead to an overo ptimistic picture of SEQUEL. 



Some of the extensions we have had to incorporate in order to make SEQUEL 
more operational for our applications are the followino: (1) Added a 
facility for multi-user to access the same data base.. (2) Added inter^ 
faces so that users'' can use a variety of terminal (3) Modified SEQUEL 
to acrppt the unary + and - operators as prefixes to numeric literals, 
and to handle DECIMAL constants, (4) Extended SEQUEL implementation re- 
strictions on maximum degree of a table, maximum length of an identifier, 
and maximum size of a character string constant, .(5) Re-wrote output 
formats fdr generality, (6) Implemented a macroprocessor capability that 
allows users to write prepackaged series of queries, (7) Made changes to 
increase performance, (8) Added the capability to interface modeling and 
analytic facilities, (9) Enhanced the bulk loading facility. (10) Designed 
mechanisijks for handling null or missing data, (11) Desiofied backup 
facilities, (12) Designed security mechanisms, (13} Designed additional 
SEQUEL operators (e.g., GROUP BY). The documentation of these changes 
as well as others is found in a NEEMIS Progress Report [N.I.T. Energy 
Laboratory, 1975],' 
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We feel that an operatfonal relational data management facility needs 
to be Implemented and .incorporated into a system that has analytical capa- 
bilities.* We strongly believe that such a development must be done tn close 
cooperation with real applications. Further, we feel that those applica- 
tions should be chosen in areas^where this technology has a clear advan- 
tage, that is, for systems where the problems keep changing Ce«g., public 
policy^stems) or where the system is not well-defined Ce.g., breadboarding 
systems), and not to application areas that are currently being satis- 
factorily met by other approaches. 
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5. FURTHER RESEARCH 

there are several types of research need to be pursued so that 
tl»e*i» tools can be made available at reasonable cost, and so they can be 
used in the most effective manner. Some of that further research .has 
been dlscusse^j In the last section. 

5.1 Computer and Management Science Research 

Optimal Ulerarchical Decomposition ." To gain Insights as to what vrould 
be the best hierarchical decoroiposltlon, research should be undertaken to 
define mfjasures that would allow the construction of proofs that a par- 
ticular decomposition of a hierarchy Is optimal. 

Performance . Each level of the hierarchy needs both a theoretical 
study and an empirical study. At each level the impact of new operators 
should be investigated, along with formalizations for equivalence between 
sets of these operators and performance implications of "new operators. ^ 
Mechanisms for reducing expressions to equivalent but more efficient expres- 
sions should be explored, por example, at the DDL and OMU level algorithms that 
heunstically take advantage of certain query patterns to make subsequent 
queries more efficient must be studied. At the relational level ways 
of simulating certain relational operations when the full operator is not 

called for must be investigated. Theoretical bounds on computation of 
relational operations as function of a 5lze of tables must be developed, t 

Virtual Machines . On the VM Interface level" there is need for 

investigation of efficient ways VM's can communicate with each other. 

On the VM level more knowledgeable processor schedulers need to be developed. 

And, as was discussed in Section AJ, work must be done on synchronization 

and locking policies of multiple VM configurations. 
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New Technologies . Investigation of the linplications of the new 
technologies (e.g.* memory, networks, and microprocessors) on each level 
in the hierarchy is called for. 

Query Languages . On the DHL level in addition to the extensions 
we have made to the SEQUEL language U,g,, multiuser interface, security, 
additional computational capability, handling larger relations, larger 
number of entries), new query language constructs ought to be inv6stiaated 
Realistic and operational Implenientations of a relational query language 
should be undertaken. . 

Syncronization and Irferlocks . Various interlock tuechanisms must be 
used in an information system to coordinate various independent update 
operations. It is necessary to develop interlock techniques and policies 
that lend themselves to a highly decentralized implementation withoul 
adversely affecting performance or reliability. For example, under what 
condition and for how long are the modeling machines locked out of the data 
base machine? Is the data base machine just a catalogue for data stored 
in the decentral i zed. data baseiwchines? If so, what are the performance 
implications of always accessing data stored in a remote machine? Or is 
the -accessed data brought up to the data base or modeling machine in which 
case what are the updating policies? What sort of hardware can best 
support the proposed hierarchical structure and. system structure? 

* 

5.2 Studies of the Economics of Information System Design 

' Traditional measures of performance (e.g., throughput, system utili* 
ration, response time, turnaround time, etc.) are potentially jnlileadlng 
and may be irrelevant for the class of information systems addressed here. 
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These measures address themselves only to the variable costs of an infor- 
nation system. In the development of an Information system there are 

fixed costs (analysis cost, design co&t, implementation cost of the soft- 
ware, as well as the hardware costs) as well as variable costs (costs of 
queries, execution of models and analytical functions). Much more* research 
is needed on the overall costs of information systems, on mor« general 
concepts of "performance," and on the types of studies that should be 
^^ooe^ io^ii^hQa&i^tg^^^i^^^ to the particular task at 

hand. : ■ 

' To illustrate the point, take the simple example of the design of an 
inventory control system for a large manufacturer on the one hand, and 
a system of roughly the same character and complexity to serve as federal 
energy policy on the other. The costs of developing such systems using 
different sets of software tools are illustrated in Figure 17. The 
solid lines show the fixed and variable costs of constructing either 
of these two systems using a conventional package, say IHS« The dashed 
line shows the cost of the same systems with tools such as those provided 
by GMIS. For the more flexible GMIS-type system the fixed costs (and 

thus the time to build the system) are much lower, but this advantage 

■ . -I 

comes at the expense of increased variable costs. 

,1 
■* 

Provided the purposes for the two systems are well known and the 
operating' assumptions are fixed, the two systems break even at Point A. 



It is likely that hardware will eventually be developed to support this 
sort of system, and variable costs will be substantially reduced [Madnick, 
1975]. 
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Systeia constructed with conventional information management 
tools. 

*■ 

% 

System constructed wi th GMIS-type tools. 
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If the applicatibn anticipates a large voluwe of queries as the inven- 
tory example might, then the conventional approach is preferred, 

Of course, to the extent that information system purposes and operating 
conditions change over time, the fixed costs of each system are multiplied 
by some factor — a condition which Veatly favors the types of tools dis- 
cussed above. 

The economics of these choices are poorly understood, and the develop- 
ment of better indices of system "performance" is a high priority item in , 
information systems research,^ When these comprehensive indices of per- 
formance are developed, however, we expect that systems like GMIS will 
receive high marks for a wide variety of applications. Alreacly'the system 
is proving its worth in application to New England energy problems, and to 
several areas of policy research in the M. I. T. Energy Laboratory. We hope 
for continued progress on the issues and problems that remain, and look 
forward to a new generation of information management and analysis systems 
that are better suited to the faS^-moving pace of mar\y corporate and public 
problems. 
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^ A GMIS-type system may still be a useful tool (as a breadboarding system) 
in the optimization of the design of the data management facility, even 
with the implementation to be carried out with some other packagp. 
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